A high-performance approach to string similarity using most frequent K characters

نویسندگان

  • Andre Valdestilhas
  • Tommaso Soru
  • Axel-Cyrille Ngonga Ngomo
چکیده

The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel String Distance Function based on Most Frequent K Characters

This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length are fast to calculate but doesn’t represent the string correctly. On the other hand the methods like keeping the histogram over all characters in the string ...

متن کامل

An Intelligent System for Exact Word Retrieval in Document Databases

Automatic Information retrieval from document image databases is an important and challenging task. The main challenges are font style, size and spacing between characters. In order to meet the challenges, we propose a new technique for matching exact word string from document databases. For this approach, we address two issues: word identification and similarity measurement between documents. ...

متن کامل

Selecting and Weighting N-Grams to Identify 1100 Languages

This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score smoothing, and its implementation in an open-source program. When applied to a collection of strings in 1100 languages containing at most 65 characters each, an average classification accuracy of ove...

متن کامل

A new virtual leader-following consensus protocol to internal and string stability analysis of longitudinal platoon of vehicles with generic network topology under communication and parasitic delays

In this paper, a new virtual leader following consensus protocol is introduced to perform the internal and string stability analysis of longitudinal platoon of vehicles under generic network topology. In all previous studies on multi-agent systems with generic network topology, the control parameters are strictly dependent on eigenvalues of network matrices (adjacency or Laplacian). Since some ...

متن کامل

Automatic Construction of Weighted String Similarity Measures

String similarity metrics are used for several purposes in text-processing. One task is the extraction of cognates from bilingual text. In this paper three approaches to the automatic generation of language dependent string matching functions are presented. 1 I n t r o d u c t i o n String similarity metrics are extensively used in the processing of textual data for several purposes such as the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017